Databricks Templatized Data Integration Jobs
You can use Databricks for data integration in a data pipeline in the Lazsa Platform. With templates available for different combinations of source and target nodes using Databricks for data integration, creating an integration job can be achieved within a few clicks. You can perform append or overwrite functions on the data and create data partitions for optimizing data querying. After performing the specified operation on the data, you can store rejected records at a specific location .
Complete the following steps to create a Databricks data integration job:
Provide job details for the data integration job:
-
Based on the source and destination that you choose in the data pipeline, the template is automatically selected.
-
Provide a name for the data integration job.
Configure the source node:
-
Review the configuration details of the source node. Based on the data source you select, the configuration details are fetched from the stage: data source.
-
Click Next.
Note:
If the file type selected in the source is Delta Table, the Databricks cluster must have the following spark configuration parameter added:
set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true;
Configure the target node:
-
Datastore - Select the S3 datastore that you want to configure.
-
Choose Target Format - Select one the following target formats:
-
Source Data Format- Select this option if you want to maintain the data format in the target, similar to the one in source.
-
Parquet - Select this option if you want to use parquet format for target data.
-
Delta Table - Select this option if you want to store data in the delta table format. If you select this option, another operation type merge is available for selection.
-
-
Target Folder - Select a target folder on S3. If you select an existing folder, the schema of the data in that folder should match with the schema of the source data. Else the data integration job will fail.
-
Target path - Provide a folder name that you want to append to the target folder. This is optional.
-
Operation Type - Append and Overwrite are two distinct operation types when dealing with data files. Choose the type of operation that you want to perform:
-
Append - Adds new data at the end of a file without erasing the existing content. If you select the append operation the schema of the source data and target data must match. Else the data integration job will fail.
-
Merge - This operation type is available only if the selected target format is Delta table. For the first job run, the data will get added to the target table. For any subsequent job runs, this option merges the data with the existing data when there is a change in the source data. The merge operation is done based on the Unique Key Column that you provide.
-
Overwrite - Replaces the entire content of a file with new data. If you select the overwrite operation even if the schema of the source data does not match with the target data, the data integration job will not fail.
-
-
Enable Partitioning - Enable this option if you want to use partitioning for the target data. Select from the following options:
-
Data Partition - Select the filename, column details, enter the column value. Click Add.
-
Date Based Partitioning - Select the type of partitioning that you want to use for target date from the options - Yearly, Monthly, Daily. Add a prefix to the partition folder name. This is optional.
You can review the final path of the target file. This is based on the inputs that you provide.
-
-
Store Rejected Records - Enable this option if you want to store records that do not meet the defined criteria at a specified location.
Note:
You can store rejected records only if you are using CSV or JSON formats in the source node.
Provide the following inputs:
-
Target Folder - Click the pencil icon. On the Target Folder screen, select a folder from the list, where you want to store rejected records.
-
Target Path - provide an additional folder to append the target folder, where you want to store rejected records.
-
Rejected File Path - Based on the inputs that you provide, the final path for rejected records is created.
-
Enable the option if you want to publish the metadata to AWS Glue Metastore on S3.
-
Metadata Store - Currently Lazsa DIS supports AWS Glue.
-
Select a configured AWS Glue Catalog from the drop-down. See Configuring AWS Glue.
-
Database - The database name is populated based on the selection.
-
Data Location - The location is created based on the selected S3 datastore and database.
-
Select Entity - Select an entity from the drop-down.
-
Glue Table - Either select an existing Glue table to which the metadata gets added or create a new Glue table to add the metadata.
You can select an all-purpose cluster or a job cluster to run the configured job. In case your Databricks cluster is not created through the Lazsa Platform and you want to update custom environment variables, refer to the following:
Cluster - Select the all-purpose cluster that you want to use for the data integration job, from the dropdown list.
Cluster Details | Description |
---|---|
Choose Cluster | Provide a name for the job cluster that you want to create. |
Job Configuration Name | Provide a name for the job cluster configuration. |
Databricks Runtime Version | Select the appropriate Databricks version. |
Worker Type | Select the worker type for the job cluster. |
Workers |
Enter the number of workers to be used for running the job in the job cluster. You can either have a fixed number of workers or you can choose autoscaling. |
Enable Autoscaling | Autoscaling helps in scaling up or down the number of workers within the range specified by you. This helps in reallocating workers to a job during its compute-intensive phase. Once the compute requirement reduces the excess number of workers are removed. This helps control your resource costs. |
Cloud Infrastructure Details | |
First on Demand |
Provide the number of cluster nodes that are marked as first_on_demand. The first_on_demand nodes of the cluster are placed on on-demand instances. |
Availability |
Choose the type of EC2 instances to launch your Apache Spark clusters, from the following options:
|
Zone |
Identifier of the availability zone or data center in which the cluster resides. The provided availability zone must be in the same region as the Databricks deployment. |
Instance Profile ARN | Provide an instance profile ARN that can access the target Amazon S3 bucket. |
EBS Volume Type | The type of EBS volume that is launched with this cluster. |
EBS Volume Count | The number of volumes launched for each instance of the cluster. |
EBS Volume Size | The size of the EBS volume to be used for the cluster. |
Additional Details | |
Spark Config | To fine tune Spark jobs, provide custom Spark configuration properties in key value pairs. |
Environment Variables | Configure custom environment variables that you can use in init scripts. |
Logging Path (DBFS Only) | Provide the logging path to deliver the logs for the Spark jobs. |
Init Scripts | Provide the init or initialization scripts that run during the start up of each cluster. |
SQS and SNS | |
---|---|
Configurations - Select an SQS or SNS configuration that is integrated with the Lazsa Platform. | |
Events
|
|
Event Details - Select the details of the events for which notifications are enabled. | |
Additional Parameters - provide any additional parameters to be considered for SQS and SNS queues. |
What's next? Data Integration using Amazon AppFlow |